Web Scraping Whoopsies

Manu Alcalá + Jameson Carter

2023-10-18

What is web scraping?

Broadly, it is the process of automatically pulling data from websites by reading their underlying code. Doing this gets complicated fast:

  • Static websites contain information in HTML code which does not change when you, the user, interact with it.
  • Dynamic websites have information in HTML code which does change as you interact with it.
  • Static vs. Dynamic websites require different code packages to scrape.
    • For example, selenium is a common dynamic web scraping library and scrapy is a common static web scraping library.
    • This is because a static scraper simply has to understand what it is looking for in the HTML. A dynamic scraper needs to simulate a human interacting with the site.

When is it worth it?

  • Other data collection efforts are impossible or prone to human error
    • Creating reproducible and reliable processes eases collaboration / quality assurance
  • Getting new data releases quickly is valuable
  • Time spent coding < time spent obtaining data in other ways + time spent on quality assurance

What makes it hard?

  • It is difficult to predict how much time a web scraping task will take.

  • Sites might change, introducing need to update.

    • Sites might be removed, stop being maintained, or completely overhauled.
  • Site maintainers may not be okay with data being scraped. Quick plug for the TECH team’s Automated Data Guidelines.

Example 1. Scraping Medicaid enrollment data

Why Medicaid enrollment data?

Since Spring 2023, states have been dis-enrolling Medicaid beneficiaries who no longer qualify since the Public Health Emergency was ended.

Why are the data interesting?

In anticipation of “the great unwinding,” many states implemented policy changes to smooth the transition.

To understand the success of these policies, we wanted time-series enrollment data for all 50 states… from a Medicaid data system that is largely decentralized.

Unreadable PDFs abound!

An example from Louisiana

and another from Ohio

A sigh of relief…

Why page through PDFs when another organization’s RAs can do it for you?

1. Identify this is a scrape-able dynamic page

One URL with data you can only get by clicking each option!

2. Confirm HTML actually contains the data

3. Code for 30 hours!

4. Bask in the glow of automated scraping

Whenever new data were released in the following 2 months, I re-ran the code and got a well-formatted excel file as output.

Little did I know, trouble was coming

What happened?

2 months later, KFF stopped updating the dashboard and changed how existing data was reported on graphs.

Example 2. Scraping Course Catalogs

Why course catalog data?

  • States are increasingly interested work-based learning (WBL) as an important strategy for helping students prepare for and access good jobs, but measurement has been limited

  • To understand the prevalence and types of WBL, we wanted course-level data from community colleges across the country

Not all catalogs are the same

An example of course descriptions listed under department pages

And an example containing links to course catalogs in .pdf format

Web crawling adventures

  • Scrapy is a web scraping and web crawling framework to extract structured data from websites
  • It uses spiders, which are self-contained web crawlers that follow links according to your instructions

Scrapy go brrrrr…

Whoopsie…

A new direction

Selenium and BeautifulSoup to the rescue!

Concluding remarks

Core questions to explore before you start

Availability of data

  • Are the data available through other routes?

  • Are the data produced by an organization that is invested in the problem long-term?

Frequency of scraping

  • Will I need to scrape the data multiple times?

  • What is the risk that the item scraped from the site will be changed?

Time-value tradeoffs

  • Is the time spent coding worth the payoff?

  • Will collecting data automatically save time on quality assurance?

Questions?

The remainder of the time is reserved for group discussion!

Thank you!

Please contact Manu Alcalá or Jameson Carter if you would like to discuss either of these projects or scope whether a use-case is reasonable.